In [1]:

    
%autosave 10









    














    



Autosaving every 10 seconds

Intro

Used by search engines, Google, and Yandex.
Many machine learning competition winners use it.

Basics 101

Examples with samples.
Features, n of them.
Response is real (regression) or -1/1 (classification)
Goal
- Find some function that minimises error on unseen data.

Decision trees

Classification and Regression Trees (CART), Breiman et al 1984
Binary, splits features on thresholds, output real (regression).
sklearn.tree.DecisionTreeClassifier/Regressor
leaves contain constant predictions
decision trees are very interpretable
- can plot or see
but they have very poor predicive performance
- seldom used alone.
- usually use ensembles (random forests, bagging, boosting)
- sklearn.ensemble

GBRTs

Advantages

can work on features with different scales
- e.g. face detection or text classification have similar scales
can change loss function
- robust loss functions like huber
non-linear feature interactions
- don't need prior knowledge in kernel
not a black box (like SVM or neural networks)

disadvantage

lots of turning
slow to train (fast to use)
like other tree-based methods, can't extrapolate

Boosting

AdaBoost
- Each member of ensemble is expert on the errors of its predecessor.
- Iterative: reweight based on errors.
- sklearn.ensemble.AdaBoostClassifer/Regressor
Viola-Jones Face Detector (2001) used it very successfully, seminal usage

J Friedman, 1999

Generalise to arbitrary loss functions
sklearn.ensemble.GradientBoostingClassifier/Regressor
Written in pure Python/Numpy, easy to extend.
Very shallow trees using custom node splitter and pre-sorting.



In [2]:

    
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2

X, y = make_hastie_10_2(n_samples=10000)
est = GradientBoostingClassifier(n_estimators=200, max_depth=3)
est.fit(X, y)

pred = est.predict(X)
est.predict_proba(X)[0]  # class probabilities









    Out[2]:





array([ 0.0325271,  0.9674729])



In [4]:

    
for pred in est.staged_predict(X):
    plt.plot(X[:, 0], pred, color='r', alpha=0.1)









    



KeyboardInterrupt

As you add more levels, you reduce variance, overfitting trickles in



In [ ]:

    
# X_test/Y_test, held back data

test_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
    test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label='Test')
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label='Train')

Tree structure

max_depth controls degree of feature interactions
- e.g. for geo data need at least 2 to capture long/lat interactions.
Friedman suggests max depth of 3-5, presenter uses 3-6.
min_samples requires sufficient samples, adds more bias, adds contraint, more general.

Shrinkage

Slow learning using learning_rate, but needs higher n_estimators
Takes longer to train but lowers test error and difference between train and test error.

Stochastic Gradient Boosting

subsample: random subset of training set
max_features: random subset of features
- Presenter recommends starting with just this.
increased accuracy.

How to tune hyperparameters (best practices)

Set n_estimators high as possible, e.g. 3000
Tune via grid search.
- param_grid
- `gs_cv = GridSearchCV(est, param_grid).fit(X, y)
- gs_cv.best_params_
- Can also use joblib.
Set n_estimators even higher and tune learning_rate

Case study

GBRT can directly minimise Mean Absolute Error (MAE).
Some methods like Random Forests act on MAE through Sum of Squares as a proxy
- This directly emphasises outliers, which increases MAE
GBRT can capture lat/long of geo coordinate interaction
est.features_importances_ lets you peek in the black box, plot to see what are the most relevant features, great for epxloratory phase.
- But don't say how features interact with each other.
- use partial_dependence for PD plots
- sklearn.ensemble.partial_dependence
- very convenient, and the computation is cheap
- automatically detect spatial effects

from sklearn.ensemble import partial_dependence as pd features = ['foo', 'bar'] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names]

Very flexible, general, non-parametric
Solid, battle tested

Questions

Reference for heuristics?
- R package gdm is great, and heavily referenced and source of heuristics.
Why this house pricing case study?
- Just general interest.
Prediction of time-series?
- General problem for tree-based data.
- Try to de-trend data before.
- To be honest haven't been exposed to time-series prediction problems.
- Maybe transform data into prediction of each tree, then run linear model (!!AI ?)
What if you want to do classification?
- Maybe use spline regression to transform the problem into a regression problem.